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ABSTRACT 

Community discovery in complex networks is an interest- 
ing problem with a number of applications, especially in 
the knowledge extraction task in social and information net- 
works. However, many large networks often lack a particular 
community organization at a global level. In these cases, tra- 
ditional graph partitioning algorithms fail to let the latent 
knowledge embedded in modular structure emerge, because 
they impose a top-down global view of a network. We pro- 
pose here a simple local-first approach to community dis- 
covery, able to unveil the modular organization of real com- 
plex networks. This is achieved by democratically letting 
each node vote for the communities it sees surrounding it in 
its limited view of the global system, i.e. its ego neighbor- 
hood, using a label propagation algorithm; finally, the local 
communities are merged into a global collection. We tested 
this intuition against the state-of-the-art overlapping and 
non-overlapping community discovery methods, and found 
that our new method clearly outperforms the others in the 
quality of the obtained communities, evaluated by using the 
extracted communities to predict the metadata about the 
nodes of several real world networks. We also show how our 
method is deterministic, fully incremental, and has a lim- 
ited time complexity, so that it can be used on web-scale 
real networks. 

Categories and Subject Descriptors 

1.5.3 [Clustering]: Algorithms 
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1. INTRODUCTION 

Complex network analysis has emerged as one of the most 
exciting domains of data analysis and mining over the last 
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decade. One of the most prolific sub field is community dis- 
covery in complex network, or CD in short. The concept of 
a "community" in a (web, social, or informational) network 
is intuitively understood as a set of individuals that are very 
similar, or close, to each other, more than to anybody else 
outside the community [6] . This has often been translated in 
network terms into finding sets of nodes densely connected 
to each other and sparsely connected with the rest of the 
network. Community discovery can be seen as a network 
variant of traditional data clustering. To efficiently detect 
these structures is very useful for a number of applications, 
ranging from targeted vaccinations and outbreak prevention 
[23] , to viral marketing [16] and to many web data analysis 
tasks such as finding tribes in online information exchanges 
[12] |25] , data compressing, clustering [2] and sampling [14] . 

The classical problem definition of community discovery 
finds a very intuitive counterpart for small networks, where 
the denser areas are easily identifiable by visual inspection, 
while the problem becomes much harder for medium and 
large scale networks. At the global level, very little can be 
said about the modular structure of the network, because on 
larger scales the organization of the system becomes simply 
too complex. The friendship graph of Facebook includes 
more than 845 millions nodes as of February 20 but the 
difficulty of the CD task can be appreciated even considering 
a tiny fragment of the Facebook friendship graph, illustrated 
in Figure [TJa). We depicted the connections among 15,000 
nodes, i.e., less than 0.002% of the total network. Even in 
this small subset of the network, no evident organization can 
be identified easily. Big networks are not analyzable with 
the naked eye. Very often, a visualization of ten thousands 
nodes results in a structureless hairball. In cases like this, 
also generic community discovery algorithms tend to return 
not meaningful communities, as they typically try to cluster 
the whole structure and return some huge communities and 
a long list of small branches (see [6]). Often, superimposing 
an order with a top-down approach leads to failure. 

On the contrary, human eyes are good in finding denser ar- 
eas in simple networks, i.e., the structure of cohesive groups 
of nodes that emerge considering a local fragment of an oth- 
erwise big network. But what does local mean? Common- 
sense goes that people are good at identifying the reasons 
why they know the people they know; therefore, each node 

1 http:/ /newsroom. fb.com/content/default.aspx?NewsAreaId 



(a) A global view of the Facebook 
graph from f 5k users. 



(b) The "ego minus ego" network of 
one Facebook user among the 15k. 



Figure 1: The real world example of the "local vs global" structure intuition. 



has presumably an incomplete, yet clear, vision of the so- 
cial communities it is part of, and that surrounds it. The 
consequences of exploiting this idea for the CD problem is 
effectively illustrated by Figure [TJ». Here, we chose one 
of the 15k nodes from the previous example and extracted 
what we call its "ego minus ego" network, i.e. its ego net- 
work in which the ego node has been removed, together with 
all its attached edges. Suddenly, everything around the ego 
makes sense and some groups can be easily spotted. These 
groups correspond to the high school and university friends, 
mates from different workplaces and the members of an on- 
line community (we know all these details because the chosen 
ego is one of the authors of this paper). The ego is part of 
all these communities and knows that particular subsets of 
its neighborhood are part of these communities too. Proba- 
bly, different egos have different perspectives over the same 
neighbors and it is the union of all these perspectives that 
creates an optimal partition of the network. In other words: 
if node A and node B are considered in the same communi- 
ties by all the nodes connected to both A and B, then they 
should be grouped in the same community. This is achieved 
by a democratic bottom-up mining approach: in turn, each 
node gives the perspective of the communities surrounding 
it and then all the different perspectives are merged together 
in an overlapping structure. 

In the vast CD literature, the general approach for the 
detection of the modular structure of a network is usually 
to develop a particular (greedy) algorithm, testing a general 
quality function with a particular heuristic and then return 
a set of communities extracted from the global structure (we 
discuss some of these methods in Section |2j. This approach 
generally fails for large networks due to the difference in 
structural organization at global and local scale. To cope 
with this difficulty, we propose a change of mentality. Since 
our community definition works perfectly in the small scale, 
then it should be applied only at this small scale. We pro- 
pose a simple local-first approach to community discovery 
in complex networks by letting the hidden modular organi- 
zation of a network emerge from local patterns. 

Essentially, we adopt a democratic approach to the dis- 



covery of communities in complex networks. We ask each 
node to vote for the communities present in its local view 
of the network. For this reason, we chose to name our algo- 
rithm Democratic Estimate of the Modular Organization 
of a Network, or DEMON in short. In practice, we extract 
the ego network of each node and apply a Label Propagation 
CD algorithm 
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on this structure, ignoring the presence 
of the ego itself, that will be judged by its peers neighbors. 
We then combine, with equity, the vote of everyone in the 
network. The result of this combination is a set of (over- 
lapping) modules, the guess of the real communities in the 
global system, made not by an external observer, but by the 
actors of the network itself. Our democratic algorithm is in- 
cremental, allowing to recompute the communities only for 
the newly incoming nodes and edges in an evolving network. 
Nevertheless, DEMON has also a low theoretical linear time 
complexity. The main core of our method has also the inter- 
esting property of being easily parallelizable, since only the 
ego network information is needed to perform independent 
computations, and it can be easily combined in a MapRe- 
duce framework [7]; although the post-process Merge pro- 
cedure is not trivially solvable in a MapReduce framework 
(and for this reason we leave a discussion about the parallel 
implementation as future work). The properties of DEMON 
support its use in massive real world scenarios. 

We provide an extensive empirical validation of DEMON. 
In our experimental setting, we are particularly interested 
in investigating what useful knowledge we can discover. We 
test the results obtained with our method against selected 
state-of-the-art algorithms, both overlapping and not over- 
lapping, since we believe that the possibility to cluster the 
same nodes in different communities is one of the crucial 
properties that a community discoverer should allow: on- 
line social networks have proved that individuals are part of 
many different communities and groups of interest. To eval- 
uate this knowledge, we make use of a multilabel predictor 
fed with the extracted communities as input, with the aim of 
correctly classifying the metadata attached to the nodes in 
real life. Our datasets include the international store Ama- 
zon, the database of collaborations in movie industry IMDb, 



and the register of the activities of the US Congress Gov- 
Track.us. 

The rest of the paper is organized as follows: in Section [2] 
we present related works in community discovery literature. 
Section [3] is dedicated to the problem representation and 
definition. Section [1] describe the DEMO N algorithm struc- 
ture, with algorithmic details and an account of the formal 
properties of the method. Our experiments are presented in 
Section [5] and finally Section [6] concludes the paper. 

2. RELATED WORK 

The problem of finding communities in complex networks 
is very popular among network scientists, as witnessed by an 
impressive number of valid works in this field. A huge sur- 
vey by Fortunato |5] explores all the most popular techniques 
to find communities in complex networks. Traditionally, a 
community is defined as a dense subgraph, in which the 
number of edges among the members of the community is 
significantly higher than the outgoing edges. However, this 
definition does not cover many real world scenarios, and in 
the years many different solutions started to explore alter- 
native definitions of communities in complex networks [6]. 

A variety of CD methods are based on the modularity 
concept, a quality function of a partition proposed by New- 
man [5] [18] . Modularity scores high values for partitions in 
which the internal cluster density is higher than the external 
density. Hundreds of papers have been written about mod- 
ularity, either using it as a quality function to be optimized, 
or studying its properties and deficiencies. One of the most 
advanced examples of modularity maximization CD is [17] , 
where the authors use an extension of the modularity for- 
mula to cluster multiplex (evolving and/or multirelational) 
networks. A fast and efficient greedy algorithm, Modularity 
Unfolding, has been successfully applied to the analysis of 
huge web graphs of millions of nodes and billions of edges, 
representing the structure in a subset of the WWW [3]. 

Many algorithms have been proposed that are unrelated 
to modularity. Among them, a particular important field 
is the application of information theory techniques, as for 
example in Infomap [22] or Cross Associations [l9]. In par- 
ticular, Infomap has been proven to be one among the best 
performing non overlapping algorithms 115]. For this reason 
we chose Infomap as alternative to modularity approaches 
as a baseline method. Further, modularity approaches are 
affected by known issues, namely the resolution problem and 
the degeneracy of good solutions [TO]. Similarly to Infomap, 
Walktrap [20] is based on flow methods and random walks. 

A very important property for community discovery is the 
ability to return overlapping partitions, i.e., the possibility 
of a node to be part of more than one community. This prop- 
erty reflects the common sense intuition that each of us is 
part of many different communities, including family, work, 
and probably many hobby-related communities. Specific al- 
gorithms developed over this property are Hierarchical Link 
Clustering [l], HCDF [l3| and k-clique percolation [8]. 

Finally, an important approach is known as Label Prop- 
agation [2l]: in this work authors detect communities by 
spreading labels through the edges of the graph and then 
labeling nodes according to the majority of the labels at- 
tached to their neighbors, iterating until a general consensus 
is reached. With a reasonable good quality on the partition, 
this algorithm is extremely fast and known to be one of the 
very few quasi-linear solutions to the community discovery 



problem, even if its plain application leads to worse results 
than Infomap and it does not return an overlapping parti- 
tion. A related work is also 1 2 , whose aim is also to discover 
local communities. However, authors are only interested in 
those local communities and they do not return any global 
structure modular organization. 

To to extract useful knowledge from the modular structure 
of networked data is also a prolific track of research. We 
recall the GuruMine framework, whose aim is to identify 
leaders in information spread and to detect groups of users 
that are usually influenced by the same leaders [12] . Many 
other works investigate the possibility of applying network 
analysis for studying, for instance, the dynamics of viral 
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3. NETWORKS AND COMMUNITIES 

We model networks and their properties in terms of simple 
graphs. For the sake of simplicity, a network is represented 
as an undirected, unlabeled and unweighted simple graph, 
denoted by Q = (V, E) where V is a set of nodes and E is a 
set of edges, i.e., pairs (u,v) representing the fact that there 
is a link in the network connecting nodes u and v. It should 
be noted, however, that our method can handle weighted, 
directed and labeled multi-graphs. 

In general terms, our problem definition is to find commu- 
nities in complex networks. However, this is an ambiguous 
goal, as the definition itself of "community" in a complex 
network, similarly to the notion of clustering in statistics 
and data mining, is not unique 6 . Furthermore, in a com- 
plex and semantically rich setting as the modern Web, one 
may want to cluster many different kinds of objects for many 
different reasons. Therefore, we need to narrow down our 
problem definition as follows. 

We define two basic graph operations. The first one is 
the Ego Network extraction EN. Given a graph Q and a 
node v e V, EN(v,G) is the subgraph Q'{V',E'), where 
V' is the set containing v and all its neighbors in E, and 
E' is the subset of E containing all the edges [u, v) where 
u G V A v G V . The second operation is the Graph- Vertex 
Difference — g: —g(v, Q) will result in a copy of Q without 
the vertex v and all edges attached to v. The combination 
of these two functions yields the EgoMinusEgo function: 
EgoMinusEgo(v,Q) — —g(v,EN(v,Q)). Given a graph Q 
and a node v G V, the set of local communities C(v) of 
node i) is a set of (possibly overlapping) sets of nodes in 
EgoMinusEgo(v,Q), where each set C G C(v) is a commu- 
nity according to node similarity: each node in C is more 
similar to any other node in C than to any other node in 
C G C(v), with C / C'. Finally, we define the set of global 
communities, or simply communities, of a graph Q as: 



C = Max(\J C(y)) 



(1) 



where, given a set of sets S, Max(S) denotes the sub- 
set of 5 formed by its maximal sets only; namely, every set 
S G 5 such that there is no other set S' G 5 with S C S' . 
In other words, by equation |TJ we generalize from local to 
global communities by selecting the maximal local commu- 
nities that cover the entire collection of local communities, 
each found in the EgoMinusEgo network of each individual 
node. 



Algorithm 1 The pseudo-code of DEMON algorithm. 

Require: Q : (V, E); C = 0; e £ [0..1] 
Ensure: set of overlapping communities C 

1: for all v £ V do 

2: e <— EgoMinusEgo(v, Q) 

3: C(i>) <— Label Propagation(e) 

4: for all C* e C(«) do 

5: C<-CU« 

6: C <- Merge{C, C, e) 

7: end for 

8: end for 

9: return C 




Figure 2: A simple simulation of the Label Propagation pro- 
cess for community discovery. 



4. THE ALGORITHM 

In this section we present our solution to the community 
discovery problem. The pseudo code of DEMON is specified 
in Algorithm [I] 

4.1 The Core of the Algorithm 

The set of discovered communities C is initially empty. 
The external (explicit) loop of DEMON cycles over each in- 
dividual node, and it is necessary to generate all the possible 
points of view of the structure and get a complete cover- 
age of the network itself. For each node v, we apply the 
EgoMinusEgo{v, Q) operation denned in Section[3] obtain- 
ing a graph e. We cannot apply simply the ego network ex- 
traction EN(v,Q) because the ego node v is directly linked 
to all nodes 6 EN(y,Q). This would lead to noise in the 
subsequent steps of DEMON, since by our definition of local 
community the nodes would be put in the same community 
if they are close to each other. Obviously a single node con- 
necting the entire sub-graph will make all nodes very close, 
even if they are not in the same community. For this reason, 
we remove the ego from its own ego network. 

Once we have the e graph, the next step is to compute the 
communities contained in e. We chose to perform this step 
by using a community discovery algorithm borrowed from 
the literature. Our choice fell on the Label Propagation (LP) 
algorithm 1211 . This choice has been made for the following 



1. LP shares with this work the definition of what is a 
community. 

2. LP is known as the least complex algorithm in the 
literature, reaching a quasi-linear time complexity in 
terms of nodes. However, 

3. LP will return results of a quality comparable to more 
complex algorithms Kx. 

Reason #2 is particularly important, since Step #3 of 
our pseudo code needs to be performed once for every node 
of the network. It is unacceptable to spend a superlinear 
time for each node at this stage, if we want to scale up to 
millions of nodes and hundreds of millions edges. Given the 
linear complexity of Step #3, we refer to this as the internal 
(implicit) loop for finding the local communities. 

We briefly describe in more detail the LP algorithm, given 
its importance in the DEMON algorithm, following the origi- 
nal article 
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Suppose that a node v has neighbors v% , Vi , . .. , Vu 
and that each neighbor carries a label denoting the commu- 
nity that it belongs to. Then v determines its community 
based on the labels of its neighbors. A three-step example 



of this principle is shown in Figure [2] The authors assume 
that each node in the network chooses to join the commu- 
nity to which the maximum number of its neighbors belong. 
As the labels propagate, densely connected groups of nodes 
quickly reach a consensus on a unique label. At the end 
of the propagation process nodes with the same labels are 
grouped together as one community. Clearly, a node with an 
equal maximum number of neighbors in two or more com- 
munities can belong to both communities, thus identifying 
possible overlapping communities. The original algorithm 
does not handle this situation. For clarity, we report here 
the procedure of the LP algorithm, that is the expansion of 
Step #3 of Algorithm [l] and represents our inner loop: 

1. Initialize the labels at all nodes in the network. For 
any given node v, C„(0) = v. 

2. Set t = 1. 

3. Arrange the nodes in the network in a random order 
and set it to V. 

4. For each Vi £ V, in the specific order, let C Vi (t) = 
f(C Vil (t — 1), . . . , C Vik (t— 1)). / here returns the label 
occurring with the highest frequency among neighbors 
and ties are broken uniformly randomly. 

5. If every node has a label that the maximum number 
of their neighbors have, or t hits a maximum number 
of iterations t max then stop the algorithm. Else, set 
t — t + 1 and go to (3). 

4.2 The Merge Function 

The result of Step #3 of Algorithm [I] is a set of local 
communities, according to the perspective of node v: at 
the end of the LP algorithm we reintroduce, in each local 
community, the node v. These communities are likely to 
be incomplete and should be used to enrich what DEMON 
already discovered so far. Thus, the next step is to merge 
each local community of C in order to obtain the result set. 
The Merge operation is defined as follows. 

Two communities C and I are merged if and only if at 
most the e% of the smaller one is not included in the bigger 
one; in this case, C and I are removed from C and their 
union is added to the result set. The e factor is introduced to 
vary the percentage of common elements provided from each 
couple of communities: e = ensure that two communities 
are merged only if one of them is a proper subset of the other, 
on the other hand with a value of e = 1 even communities 
that do not share a single node are merged together. 



Algorithm 2 The pseudo-code of Merge function. 

Require: C = Community set; C = Community; e S [0..1] 
Ensure: set of overlapping communities C 

1: for all I e C do 

2: if Csize < I .sizeandC C e / then 

3: u = CUl; 

4: C-C;C-/; 

5: C = CUu; 

6: end if 

7: end for 

8: return C 



4.3 DEMON Properties 

To prove the correctness of the DEMON algorithm w.r.t. 
the problem definition in Section [3] we prove by induction 
that the following holds: 

Property f . At the k-th iteration of the outer loop of 
DEMON, for all k>0: 

C = Max{ |J C{v)) (2) 

v=vi v k 

where vi , . . . , i>& are the nodes visited after k iterations. 

Property |l| trivially holds for k — 0, i.e., at initialization 
stage. For k > 0, assume that the property holds up to 
k — 1. Then C contains the maximal local communities of 
the subgraph with nodes vi, . . . , Vk-i- By merging every lo- 
cal community C of node vu into C, we guarantee that C is 
added to the result only if it is not covered by any preex- 
isting community, and, if added, any preexisting community 
covered by C is removed from C. As a result, after merging 
all communities in C(vk) into C in Steps #4-6, the latter is 
the set of maximal communities covering all local commu- 
nities discovered in v\, . . . ,Vk- Therefore, we can conclude 
that DEMON is a correct and complete implementation of 
the CD problem stated by equation Q. More generally, de- 
noting by DEMON(Q ,C) the set of communities C' obtained 
by running the DEMON algorithm on graph Q starting with 
the (possibly non-empty) set of communities C, the following 
properties hold. 

Property 2. Correctness and Completeness. 

If DEMON(Q,C) = C, where Q = (V,E), then 

C' = Max(CU \JC(v)) (3) 

vGV 

In other words, given a preexisting set of communities C 
and a graph Q, DEMON returns all and only the commu- 
nities obtained extending C with the communities found in 
Q, coherently with the definition of communities given in 
equation Q. 

Property 3. Determinacy and Order insensitivity. 

There exists a unique C' = DEMON(Q ,C) for any given Q 
and C, disregarding the order of visit of the nodes in Q. 

This is a direct corollary of property ([2| and of the unique- 
ness of the set Max(S) for any set of sets 5, under the 
assumption that the set of local communities C(v) is also 
uniquely assigned, for any node v. Therefore, the order in 
which the nodes in Q are visited by DEMON is irrelevant. 



Property 4. Compositionality. Consider any parti- 
tion of a graph Q into two subgraphs Q\, Q 2 such that, for 
any node v of Q, the entire ego network of v in Q is fully 
contained either in Q\ or Q 2 . Then, given an initial set of 
communities C: 

DEMON(Q 1 UQ 2 ,C) = Max(DEMON(Q 1 ,C)uDEMON(Q 2 ,C)) 

(4) 

This is a consequence of two facts: i) each local community 
C(v) is correctly computed under the assumption that the 
subgraphs do not split any ego network, and ii) for any 
two sets of sets <Si,52, Max(Si U S 2 ) = Max(Max(Si) U 
Max{S 2 )). 

Property 5. Incrementality. Given a graph Q, an ini- 
tial set of communities C and an incremental update AQ 
consisting of new nodes and new edges added to Q , where 
AQ contains the entire ego networks of all new nodes and of 
all the preexisting nodes reached by new links, then 

DEMON(Q U AQ,C) = DEMON(AQ, DEMON(Q,C)) (5) 

This is a consequence of the fact that only the local com- 
munities of nodes in Q affected by new links need to be 
reexamined, so we can run DEMON on AQ only, avoiding 
to run it from scratch on Q U AQ. 

Properties Q and Q have important computational reper- 
cussions. The compositionality property entails that the 
core of DEMON algorithm as described in subsection |4.1| 
is highly parallelizable, because it can run independently 
on different fragments of the overall network with a rela- 
tively small combination work. Each node of the computer 
cluster needs to obtain a small fragment of the network, 
as small as the ego network of one or a few nodes. The 
Map function is simply the LP algorithm. The incremen- 
tality property entails that DEMON can efficiently run in 
a streamed fashion, considering incremental updates of the 
graph as they arrive in subsequent batches; essentially, in- 
crementality means that it is not necessary to run DEMON 
from scratch as batches of new nodes and new links arrive: 
the new communities can be found by considering only the 
ego networks of the nodes affected by the updates (both new 
nodes and old nodes reached by new links). This does not 
trivially hold for the Merge function presented in subsection 
|4.2[ therefore the actual parallel implementation of DEMON 
is left as future work. However, different and simpler Merge 
functions can be define to combine the results provided by 
the core of the algorithm, thus preserving its possibility to 
scale up in a parallel framework. 

4.4 Complexity 

We now evaluate the time complexity of our approach. 
DEMON core (Section |4.1[ ) is based on the Label Propaga- 
tion algorithm, whose complexity is 0(n + m) [21] , where 
n is the number of nodes and m is the number of edges. 
LP is performed once for each node. Let us assume that we 
are working with a scale free network, whose degree distri- 
bution is pk — k~ a . This means that there are nodes 
with degree k. If K is the maximum degree, the complexity 
would be 53 fc=1 (pr X (k + fc ( fc ~ 1 ) )) because for each node 
of degree k we have an ego network of k nodes and at worst 
k ( k ~ 1 ' 1 edges. This number is very small for the vast ma- 
jority of nodes, being the degree distribution right skewed, 
thus many nodes have degree I or 2. We omit the solution 



of the sum with the integral and we report that the com- 
plexity is then dominated by a single term, ending up to be 
0(nK 3 ~ a ). This means that the stronger is the a exponent, 
the faster is DEMON: with a — 3 we have few super-hubs 
for which we basically check the entire network few times 
and the rest of nodes add nothing to the complexity; with 
a = 2 we have many high degree nodes and we end up with 
higher complexity, but still subquadratic in term of nodes 
(as, with a = 2, K « n). 

5. EXPERIMENTS 

We now present our experimental findings. We make use 
of three networked datasets, representing very different phe- 
nomena. We first concentrate on evaluating the quality of 
a set of communities discovered in these datasets, compar- 
ing the results with those of other competing methods in 
terms of the predictive power of the discovered communities. 
Since real world data are enriched with annotated informa- 
tion, we measure the ability of each community to predict 
the semantic information attached with the metadata of the 
nodes within the community itself. 

Next, we assess the community quality using a global mea- 
sure of community cohesion, based on the intuition that 
nodes into the same community should possess similar se- 
mantic properties in terms of attached metadata. 

The selected competitors for our assessment are: Hier- 
archical Link Clustering (HLC) jl], that has been proven 
able to outperform all the overlapping algorithms, includ- 
ing the fc-clique Propagation algorithm by Palla et al [8]; 
two random walks based methods, one focusing on mini- 
mizing random walk entropy (Infomap [22]) and the other 
relying on a general flow method (Walktrap [20|); a leading 
eigenvector-based community discovery, namely Modularity 
maximization in the fast greedy implementation introduced 
in [5] . Finally, we present some examples of knowledge that 
we are able to extract from the communities found by the 
DEMON algorithm. 

Note that we are not able to provide the analytic evalua- 
tion for Amazon dataset: for that network HLC algorithm 
was not able to provide results due to memory consumption 
problems, while the other community discovery algorithms 
usually returned some huge communities that was not pos- 
sible to analyze (see Section 5.2 and particularly Figure [4] 
for more information). 

The experiments were performed on a Dual Core Intel i7 
64 bits @ 2.8 GHz, equipped with 8 GB of RAM and with 
a kernel Linux 3.0.0-12-generic (Ubuntu 11.10). The code 
was developed in Java and it is available for download with 
the network datasets usecj^] For performances purposes, we 
mainly refer to the biggest dataset, i.e. Amazon: the core of 
the algorithm (Section |4.1[ ) took less than a minute, while 
the Merge function (Section 4.2| with increasing thresholds 
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\V\ \E\ k 


Congress 

IMDb 

Amazon 


526 14,198 53.98 
56,542 185,347 6.55 
410,236 2,439,437 11.89 



can take from one minute to one hour. 

5.1 Networks 

We tested our algorithms on three real world complex net- 
works extracted from available web services of different do- 
mains. A general overview about the statistics of these net- 
works can be found in Table [I] where: |V| is the number of 
nodes, \E\ is the number of edges and k is the average degree 
of the network. Congress and IMDb networks are similar to 

2 http://www.di.unipi.it/~coscia/demon/ 



Table 1: Basic statistics of the studied networks. 



the ones used in [I], generally updating the source dataset 
with a more recent set of data, and we refer to that paper for 
a deeper description of them. The networks were generated 
as follows: 

Congress. The network of legislative collaborations be- 
tween US representatives of the House and the Senate during 
the 111st US congress (2009-2011). We downloaded the data 
about all the bills discussed during the last Congress from 
GovTraclQ 

a web-based service recording the activities of 
each member of the US Congress. The bills are usually co- 
sponsored by many politicians. We connect politicians if 
they have at least 75 co-sponsorships and delete all the con- 
nections that are created only by bills with more than 10 
co-sponsors. Attached to each bills in the Govtrack data we 
have also a collection of subjects related to the bill. The set 
of subjects a politicians frequently worked on is the qualita- 
tive attribute of this network. 

IMDb. We downloaded the entire database of IMDb from 
their official API^Jon August 25th 2011. We focus on actors 
who star in at least two movies during the years from 2001 to 
2010, filtering out television shows, video games, and other 
performances. We connect actors with at least two movies 
in which they both appear. This network is weighted ac- 
cording to the number of co-appearances. Our qualitative 
attributes are the user assigned keywords, summarizing the 
movies each actor has been part of. 

Amazon. We downloaded Amazon data from the Stan- 
ford Large Network Dataset CollectiorJ^] In this dataset, 
frequent co-purchases of products are recorded for the day 
of May 5th 2003. We transformed the directed network in an 
undirected version. We also downloaded the metadata infor- 
mation about the products, available in the same repository. 
Using this metadata, we can define the qualitative attributes 
for each product as its categories. 

5.2 Quality Evaluation via Label Prediction 

We first assess DEMON performances using a classical 
prediction task. We attach the community memberships of 
a node as known attributes, then its qualitative attributes 
(real world labels) as target to be predicted; we then feed 
these attributes to a state-of-the-art label predictor and record 
its performance. Of course, a node may have one or more 
known attributes, as both DEMON and HLC are overlap- 
ping community discoverers; and it may have also one or 
more unknown attributes, as it can carry many different la- 
bels. 

For this reason, we need a multilabel classificator, i.e. a 
learner able to predict multiple target attributes [24]. We 
chose to use the Binary Relevance Learner. The BRL learns 
\L\ binary classifiers Hi : X — > {I, -iZ}, one for each different 
label Z £ L. It transforms the original data set into \L\ data 
sets Di that contain all examples of the original data set, 

3 http: / /www.govtrack.us/developers/data.xpd 
4 http:/ /www. imdb.com/interfaces 



3 http:/ /snap. stanford.edu/data/index. html 
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0.14740 
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0.00099 


0.00725 


IMDb 


0.44252 


0.43078 


0.38470 


0.10692 


0.17488 



Table 2: The F-Measure scores for Congress and IMDb 
dataset and each community partition. 



labeled as I if the labels of the original example contained 
I and as -il otherwise. It is the same solution used in or- 
der to deal with a single-label multi-class problem using a 
binary classifier. Note that this classifier does not penalize 
per se non-overlapping partitions, as each target label is clas- 
sified independently, and this property is requested to fairly 
confront overlapping algorithms such as DEMON and HLC, 
with the other non-overlapping algorithms. We used the 
Python implementation provided in the Orange softwar^] 
For time and memory constraints due to the BRL complex- 
ity, for IMDb we used as input only the biggest communities 
(with more than 15 nodes) and eliminating all nodes that 
are not part of any of the selected communities. 

Multi-label classification requires different metrics than 
those used in traditional single-label classification. Among 
the measures that have been proposed in the literature, 
we use the multi-label version of the standard Precision 
and Recall measures. Let Di be our multi-label evaluation 
data set, consisting of \Di \ multi-label examples (xt, Yi), i = 
\..\Di\,Yi C L. Let H be our BRL multi-label classifier and 
Zi = H(xi) be the set of labels predicted by H for x%. Then, 
we can evaluate Precision and Recall of H as: 



Precision{H , Di 



Recall(H, D,) = 



1 



E 



1 

Wi 



\Dl\ 

E 



Yi n Zj\ 

m 



We then derive the F-measure from Precision and Recall. 
For alternatives multi-label evaluations, we refer to [II]. The 
results are reported in Table [2] and show how DEMON out- 
performs its competitors. We did not test Amazon network 
as HLC was not able to provide results due to its complexity 
and further the BRL classifier was not able to scale for the 
overall number of nodes and labels. 

For IMDb dataset, HLC was able to score almost like 
DEMON. However, there is an important distinction to be 
made about the quantity of the results: if the community 
discovery returns too many communities, then it is difficult 
to actually extract useful knowledge from them. We re- 
ported in Table [3] the basic statistics about the partitions 
returned by the algorithms: number of communities (|C|) 
and average community size (jc|). For DEMON, we report 
the statistics of the communities extracted with e = 0. As 
we can see, not only DEMON scores better results, but it 
does with 70-80% less communities than HLC and with an 
average community size more manageable than Infomap. 

We report in Table [3] the results for e = 0. However, we 
vary the e threshold and see what happens to the number 
of communities and to the quality of the results. We can 
see that for both Congress and IMDb the Precision, Re- 
call and F-Measure scores remains constant (and actually 
F-Measure peaks at e — 0.076 and e = 0.301 for Congress 

6 http://orange.biolab.si/ 
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Figure 3: Precision, Recall, F-Measure and number of com- 
munities for different e values. 
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Figure 4: The distribution of the community sizes for DE- 
MON and Infomap in the Amazon network. 



and IMDb respectively) before falling for increasing e val- 
ues, while the relative number of communities dramatically 
drops. For Congress, we have the maximum F-Measure with 
only 175 communities; while for IMDb F-Measures peaks 
with 6,508 communities (in both cases, less than 50% of the 
communities at e = and than an order of magnitude of 
HLC). 

A final consideration is needed about the size distribu- 
tion of the communities detected by DEMON and the other 
community discovery algorithms. In Figure [4] we depicted 
the community size distribution for DEMON and Infomap 
for the Amazon network. As we can see, Infomap returned, 
among the others, a giant community, one order of mag- 
nitude greater than the biggest one returned by DEMON. 
To analyze such a community has been proved impossible, 
and this is another reason why we are not able to provide 
an analitycal evaluation of the results extracted from the 
Amazon network. 

We can conclude that DEMON with a manageable number 
of communities is able to outperform more complex methods 
and the choice of e can make the difference in obtaining 
a narrower set of communities with the same (or greater) 
overall quality. 

5.3 Quality Evaluation via Community Cohe- 
sion 

As presented at the beginning of this section, the networks 
studied here possess qualitative attributes that attaches a 
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5,991 27.1574 


3 175.3333 
4,746 11.9157 
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Table 3: Statistics of the community set returned by the different algorithms. 
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5.1589 


0.1400 


1.4652 


0.0211 



Table 4: The Community Quality scores for Congress and 
IMDb dataset and each community partition. 

small set of annotations or tags to each node. Assuming 
that these qualitative attributes form a description of the 
node, beyond the network itself, we can reasonably state 
that "similar" nodes share more qualitative attributes than 
dissimilar nodes. This procedure is not standard in commu- 
nity discovery results evaluation. Usually authors prefer to 
use the established measure of Modularity. However, Mod- 
ularity is strictly (and exclusively) dependent on the graph 
structure. What we want evaluate is not how a graph mea- 
sure is maximized, but how good is our community partition 
in describing real world knowledge about the clustered enti- 
ties. 

We quantify the matching between a community parti- 
tion and the metadata by evaluating how much higher are 
on average the Jaccard coefficient of the set of qualitative 
attributes for pair of nodes inside the communities over the 
average of the entire network, or: 

\QA( ni )nQA( n2 )\ 
Cn(P\ — ( " l! " i)6F \QA( ni )\jQA(n 2 )\ 
W > y- \QA( ni )nQA(n 2 )\ ' 

Z^(n 1 ,n 2 )£E \QA( ni )UQA(n 2 )\ 

where P is the set of node pairs that share at least one 
community, QA{n) is the set of qualitative attributes of node 
n and E is the set of all edges. If CQ(P) = 1, then there is 
no difference between P and the average similarity of nodes, 
i.e. P is practically random. Lower values implies that 
we are grouping together dissimilar nodes, higher values are 
expected for an algorithm able to group together similar 
nodes. 

To calculate the Jaccard coefficient for each pair of the 
network is computationally prohibitive. Therefore, for IMDb 
we chose a random set of 400k pairs. Moreover, CQ is bi- 
ased towards algorithms returning more communities. For 
this reason, we just collected random communities from the 
community pool, trying to avoid too much overlap as we 
want also to maximize the number of nodes considered by 
CQ (i.e. we try not to consider more than one community 
per node). We apply this procedure for each algorithm and 
calculated the CQ value. We repeated this process for 100 
iterations and we report in Table [4] the average value of the 
CQ obtained. Also in this case, DEMON was able to easily 
outperform all the other algorithms. 

5.4 Interpretation of Discovered Communities 

In this Section we present a brief case study using the 
communities extracted for the previously exposed evalua- 
tion of DEMON. Aim of the section is to demonstrate that 
the extracted communities have practical applications in the 
extraction of knowledge from real world scenarios. In the 




Figure 5: A representation of parts of the two communities 
surrounding our case study in the amazon network. 

Amazon network to have different communities for each item 
is very useful. A recommendation system is able to better 
discern if a user may be interested in a product or not given 
that he bought something else; however being part of one 
community of products does not mean that that particular 
community describes all aspects of a particular product. 

Let us consider, as an example, the case of Jared Dia- 
mond's best selling book "Guns, Germs, and Steel: The 
Fates of Human Societies". Clearly, it is difficult to say 
that the people interested in this book are always interested 
in the same things. Checking the communities to which it 
belongs, we find two very different big communities (a de- 
piction of the two communities is provided by Figure |5|. 
These communities have some sort of overlap, however they 
can be characterized by looking at the products that appear 
exclusively in one or in the other. In the first one we find 
books such as: "Beyond the State: An Introductory Cri- 
tique", "The Econometrics of Corporate Governance Stud- 
ies" and "The Transformation of Governance: Public Ad- 
ministration for Twenty-First Century America". This is 
clearly a community composed mainly by purchases made 
by the people more interested in the socioeconomic aspects 
of Diamond's book. The second community hosts products 
such as: "An Introduction to Metaphysics", "Aristophanes' 
Clouds Translated With Notes and Introduction" and "Be- 
ing and Dialectic: Metaphysics and Culture". This second 
communities is apparently composed by the purchases of 
customers more attracted by the underlying philosophical 
implications of Diamond's study. Products in one commu- 
nities may have something in common, but they are part of 
two distinct and very well characterized groups, and the one 
in one group are not expected to be found in the other. 

This is of course one of the many cases. We report as an 
additional example the two communities around the histori- 
cal novel "The Name of the Rose" by Umberto Eco: one com- 
munity is characterized by history related products (such as 
"Ancestral Passions : The Leakey Family and the Quest for 
Humankind's Beginnings"), the other by costume fiction (for 
example the 1932 Dreyer's movie "Vampyr"). 



6. CONCLUSION AND FUTURE WORKS 

In this paper we proposed a new method for solving the 
problem of detecting latent knowledge from significant com- 
munities in complex networks. Wc propose a democratic 
approach, where the peer nodes judge where their neighbors 
should be clustered together. This approach has robust the- 
oretical properties, including correctness and completeness 
w.r.t. a precise community definition, determinacy, composi- 
tionality and increment ality, that make it amenable to chal- 
lenge the conceptual and computational challenge to analyze 
networks with millions of nodes. We have shown in the ex- 
perimental section that this method allows a discovery of 
communities in different real world networks collected from 
information rich datasets. The quality of the overlapping 
partition, a partition that allows nodes to be in different 
communities at the same time, is improved w.r.t state-of- 
the-art algorithms, evaluated using the communities to pre- 
dict the metadata attached to the nodes, and according to 
a quantitative quality function, also metadata-based. 

Many lines of research remain open for future work, such 
as an efficient parallel implementation that may make DEMON 
the first community discovery algorithm able to scale to bil- 
lions of nodes; different merging strategies that may fur- 
ther improve the quality of the results; different hosted al- 
gorithms can be used instead of the Label Propagation algo- 
rithm in the inner loop of DEMON, to extract communities 
according to different definitions. 
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